{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# K-Fold Cross Validation"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's revisit the Iris data set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.model_selection import cross_val_score, train_test_split\n",
    "from sklearn import datasets\n",
    "from sklearn import svm\n",
    "\n",
    "iris = datasets.load_iris()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "A single train/test split is made easy with the train_test_split function in the cross_validation library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.96666666666666667"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Split the iris data into train/test data sets with 40% reserved for testing\n",
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)\n",
    "\n",
    "# Build an SVC model for predicting iris classifications using training data\n",
    "clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)\n",
    "\n",
    "# Now measure its performance with the test data\n",
    "clf.score(X_test, y_test)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "K-Fold cross validation is just as easy; let's use a K of 5:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.96666667  1.          0.96666667  0.96666667  1.        ]\n",
      "0.98\n"
     ]
    }
   ],
   "source": [
    "# We give cross_val_score a model, the entire data set and its \"real\" values, and the number of folds:\n",
    "scores = cross_val_score(clf, iris.data, iris.target, cv=5)\n",
    "\n",
    "# Print the accuracy for each fold:\n",
    "print(scores)\n",
    "\n",
    "# And the mean accuracy of all 5 folds:\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Our model is even better than we thought! Can we do better? Let's try a different kernel (poly):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 1.          1.          0.9         0.93333333  1.        ]\n",
      "0.966666666667\n"
     ]
    }
   ],
   "source": [
    "clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)\n",
    "scores = cross_val_score(clf, iris.data, iris.target, cv=5)\n",
    "print(scores)\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "No! The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.96666666666666667"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Build an SVC model for predicting iris classifications using training data\n",
    "clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)\n",
    "\n",
    "# Now measure its performance with the test data\n",
    "clf.score(X_test, y_test)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "That's the same score we got with a single train/test split on the linear kernel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Activity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The \"poly\" kernel for SVC actually has another attribute for the number of degrees of the polynomial used, which defaults to 3. For example, svm.SVC(kernel='poly', degree=3, C=1)\n",
    "\n",
    "We think the default third-degree polynomial is overfitting, based on the results above. But how about 2? Give that a try and compare it to the linear kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
